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1. Introduction 

In order to accelerate numerical simulations in lattice QCD, different preconditioning tech- 
niques are being used. We will try to make a comparison between two popular ways of precon- 
ditioning the Hybrid Monte Carlo (HMC) for improved Wilson fermions: domain decomposi- 
tion introduced by Liischer in the DD-HMC algorithm[|l]] and Hasenbusch's mass preconditioning 
(MP)[Q]. In both approaches, the quark determinant is factorized into a part which is dominated 
by the infra-red and another which is largely ultra-violet. This leads to the reduction of the quark 
force magnitude in the molecular dynamics equations. Therefore, the associated integration step 
sizes can be set to larger values, which gives an acceleration of the algorithm. 

Comparisons between algorithms are difficult. In a modern lattice QCD computation, many 
parameters have to be set. Since their number can go into the dozens — their optimal values de- 
pending on each other — it is virtually impossible to find the minimum, at which each algorithm 
performs best, and then make a true comparison. In particular, the performance is determined by 
the auto-correlation times of the observables of interest, whose measurements require runs with 
high statistics. Furthermore, the optimal values of the parameters and the relative performance of 
the algorithms might also depend on the implementation and the computer the simulation is run on. 

At least from the point of view of implementation, we tried to have the comparison on virtually 
equal grounds: for the DD-HMC, we used Liischer's publicly available code 1 and from it, we 
developed an implementation of the mass preconditioned HMC, reusing as many building blocks 
as possible. In particular the locally deflated solver introduced in the DD-HMC framework^ turns 
out to greatly speed up the simulations in both algorithmic setups. 

The block decomposition, on which the DD-HMC is based, allows for a decoupling of the 
blocks for a large part of the forces in such a simulation. The links on the boundary are not updated 
during the trajectory, which results in reduced communication, and is therefore suitable for clusters 
with fast nodes and a relatively slow connection. However, this also means that only a fraction R 
of the links is "active". The naive expectation that the auto-correlation times grow proportional 
to R 1 is confirmed in pure gauge theory [EJ, however, with dynamical fermions the cost is not 
reduced accordingly. There is a competition between the need of the computer, which are small 
blocks using many processors of massively parallel machines, and the need of the physics, which 
asks for blocks of a physical size of at least 0.5fm to provide an efficient preconditioning. The size 
of the blocks will determine the relative size of the block force and the global remainder. 

In mass preconditioning, the fermion matrix is preconditioned (basically) by one with a fermion 
of a larger mass. The value of its mass plays the role of the size of the block in domain decomposi- 
tion, but it is continuous and therefore can be tuned. Also more than one preconditioning fermion 
can be easily used. 

Both algorithms have been compared in [Q], however, new to the present study is the use of the 
deflated solver in both algorithms. The DD-HMC is well documented in the literature, so we will 
only describe our setup for the MP-HMC algorithm in Section ||] and and continue to our test in 
Sect. § |and |. 

1 http://cern.ch/luscher/dd-hmc 
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2. Action and algorithms 

We are simulating Nf = 2 degenerate flavours of non-perturbatively 0(a) improved Wilson 
quarks, using the Wilson gauge action. The Dirac operator in this formulation is given by 

D(m) =D w + c sw J^^ i v ^(J tlv F^ v + m , (2.1) 

where D\y represents the unimproved Wilson Dirac operator without mass term, c sw is the improve- 
ment coefficient and m the bare quark mass. 

In our implementation of MP-HMC, we use mass preconditioning for the Schur complement 
of the symmetric even-odd preconditioning. Starting from the standard decomposition, 

det Q = det(y 5 D) = det Q ee det Q 00 det Q s with Q s = 1 - Q' 1 Q eo Qoo Qoe , (2.2) 

we write 

det£ = detQ ee detQ l)0 det \W(Am)] det [W~\Am)Q s ] (2.3) 
where W (Am) = Qs + Am with Am > 0. This leads to the effective action 

S eff = -2(logdet<2 ee + logdet<2„ ) + \W~\hm) | 2 + |(1 + ^)<*>2| 2 • (2.4) 

Using again a Schur complement approach, the inverse of the preconditioned operator Qs can be 
constructed from the inverse of the full Hermitian Dirac operator Q 

Qf =PeQT X Q ee P e . (2.5) 

In the following, the forces associated with pseudo-fermion field are denoted by F\, whereas Fi 
are the forces from <I>2. 

In the DD-HMC, the quark determinant is written as the product of the determinants of the 
Dirac operator restricted to the blocks and a factor which accounts for the remaining contributions 
to the fermion determinant. The latter factor couples the gauge fields on the different blocks. The 
block forces are referred to as F\ and F? is the block interaction force. For details of this setup see 

El 

In both setups, UV/IR separation due to the preconditioning opens the possibility to integrate 
F2 on a larger time scale than F\ . 

3. Simulation parameters 

We have performed runs on a 48 x 24 3 lattice at j3 = 5.3 and c sw = 1.90952, corresponding to 
the lattice spacing in physical units of a ?» 0.07 lfm from ro = 0.5fm The hopping parameter 
K"sea = 0.13625 corresponds to a pion mass of m n 420MeV and m K L « 3.6. In all our runs, the 
trajectory length is set to T = 0.5, for which we take the normalization of Ref. [|l|]. These parameters 
were also used in DD-HMC simulations without deflation in Ref. [|7|]. 

In DD-HMC, the locally deflated Schwarz-preconditioned generalized conjugate residual (GCR) 
solver described in [Q] is employed for the inversions in F%. The less expensive inversions on the 
blocks, needed in the computation of F\, are done with the BiCGstab algorithm. In the multiple 
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without defl. 


with defl. 




107 


23 


comp. timefT^) 


780.42s 


264.98 s 



Table 1: Average number of GCR iterations per trajectory and the total execution time for the F2 computation 
in MP-HMC with the deflated solver and without the application of deflation. The size of the deflation blocks 
is 6 2 x 4 2 . Note that a reduction in average time by a factor ^ 3 is achieved, taking into account the time 
needed for the construction of the deflation subspace. 



time scale integration we use A^i steps in the fermion force F\ for each of the N2 steps per trajectory 
of F2 and analogously No steps of the gauge force per step in F\. We choose N2, N\ and No to be 
18,5 and 6, respectively, for a block size of 6 2 x 12 . 

Without much tuning, for the MP case we have taken the same step size at the outer force 
and the rest of the parameters is chosen according to the ratio of the forces magnitude, i.e. ||/^[| : 
\\F\ || : ||Fo|| = 1 : 9 : 36. To be on the safe side, we have chosen N%, N\, No = 18, 10, 10. In our 
version of MP-HMC, the Schwarz-preconditioned GCR solver is employed for the computation 
of both F\ and F2. The demanding inversions with the low quark mass in the F2 computation are 
done with the deflated version of the solver whereas in the F\ computations the deflation was off. 
Here, the preconditioning parameter is the positive mass difference added to the symmetrically 
preconditioned Dirac operator which we set to Am pa 0.09. More tuning could probably lead to a 
better performance than the one discussed below. 

In both setups, the standard leap frog integration is implemented and for the prediction of the 
solution in all inversions, the chronological inversion method of Brower et al. [^] is used. 



4. Solver performance 

The Schwarz-preconditioned GCR solver used for all the inversions in MP-HMC is taken 
over from the DD-HMC environment. The results of intensive tests of this solver implementation 



within DD-HMC can be found in |10J. As expected, we find the improved performance of the 
deflated solver also in the MP-HMC, gaining roughly a factor of three in the time needed for the 
computation of F2 on our lattices compared to the case without the application of deflation, details 
can be found in Table [j] 

For the update of the deflation subspace the same criteria as in the DD-HMC setup were 
applied [[RJ], however, since in MP-HMC all links are active, the deflation subspace has to be up- 
dated more often than in the DD-HMC case to satisfy the same update criteria, see Fig. [j] In 
principle, we could have optimized the update criterion of the deflation space even further to match 
the particular conditions in the MP-HMC given by the cost of the F2 force computation, but we have 
already achieved a very significant gain and do not expect any further improvements to dramatically 
change the situation. 



5. Quark forces and stability 

At small quark masses, instabilities in the numerical integration of the molecular dynamics 
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Figure 1: History of the iteration numbers AfccR of the deflated Schwarz-preconditioned GCR solver along 
a molecular-dynamics trajectory, using the DD-HMC (left side) and MP-HMC (right side) algorithm, plotted 
against the molecular-dynamics time T given in units of the integration step size £2 = -m- The lattice size 
in both cases is 48 x 24 3 and K" sea = 0.13625. The vertical lines indicate the refreshment of the deflation 
subspace. 

equations may occur, which manifest themselves in violent fluctuations in the energy violation AH 
of the molecular dynamics evolution. According to the tests of the DD-HMC algorithm performed 
so far, severe integration instabilities were rare [f7|]. This has to be demonstrated also for the MP- 
HMC and for both algorithms when going to smaller quark masses. 

Typically, the force F2 is the source of these instabilities, and in Fig. ^| we show Monte Carlo 
time histories for F\ and F2 showing the maximal and average forces, together with the correspond- 
ing AH throughout roughly 600 trajectories for DD-HMC and 300 in the MP-HMC case. One can 
see that the average force in the case of MP-HMC fluctuates a lot less than in DD-HMC, which is 
reflected in smaller magnitude and fluctuations in AH for the mass preconditioning. 

6. Efficiency of the algorithms 

The question of performance of the two algorithms must be addressed empirically and al- 
though a final answer would require to include in the comparison a series of lattice sizes and quark 
masses, our study could give us a first insight into how the two approaches relate in computa- 
tional cost. This includes not only the cost of performing a single trajectory, but also the related 
auto-correlation times, because what matters is the cost of achieving a certain statistical error. The 
presented computations are performed on the SGI Altix which is built of Intel Xeon Gainestown 
X5570 CPUs with InfiniBand connections at HLRN supercomputing system at ZIB in Berlin and 
RRZN in Hannover, and the relative timings can certainly be different on other architectures. 

In Table g we show the average plaquette value in both runs, integrated autocorrelation time 
Tint an d the acceptance rate for the two cases. As we have already mentioned, the fraction of active 
links in the DD-HMC for the chosen parameters is R = 36%, increasing it would mean to run 
with fewer than 32 MPI processes. Note, however, that this is partially due to the relatively small 
L/a = 24. It has been shown for the pure gauge theory that the autocorrelation time scales with 
the inverse of this fraction[Q] as long as the blocks are of a reasonable size. In order to be able to 
compare the two algorithms, we have multiplied the current result for the integrated autocorrelation 
time with R, and scaled the execution time accordingly. Since the available statistics is not enough 
for the reliable estimation of the errors in the autocorrelation times, the values for the Ti nt and its 
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wall clock / R 


Up 


x int {U P )xR 


N,raj 


acc. rate 


DD-HMC 


2010s 


1.65106(10) 


10(5) 


840 


90(1)% 


MP-HMC 


1530s 


1.65127(10) 


10(4) 


432 


85(2)% 



Table 2: Comparison of the DD-HMC algorithm with the MP-HMC. Both simulations are done for the 
improved Wilson theory with two degenerate fermion flavours. The lattice size is 48 x 24 3 , lattice spacing 
a « 0.07 fm and the pion mass m n « 400 MeV. The block size in DD-HMC is 6 2 x 12 2 , while in the MP case, 
the difference in mass Am ~ 0.09. Here R represents the fraction of active links in the algorithm, R = 0.36 
for DD-HMC and R = 1 for MP-HMC. The two algorithm demonstrate comparable performance. 



error should be taken only as first estimates. Including the acceptance into consideration, we can 
conclude from the performed study that in both approaches roughly the same CPU time is needed 
for the same error in the measured observable Up. 

7. Summary 

In this contribution, we have compared DD-HMC, one of the most efficient algorithms for 
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Figure 2: Histories of the energy violation AH, as well as maximum and average forces F2 and F\, for 
each force update, plotted as a function of the trajectory number. Values corresponding to the DD-HMC 
algorithm are shown to the left and the integration stepsizes for the two forces relate as Af 2 : At\ = 1 : 6. The 
values for MP-HMC are shown in the two right panels and the corresponding ratio of the integration steps is 
At\ : At 2 = 1 : 10. The lattice size is 48 x 24 3 and fc sea = 0.13625. 
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dynamical QCD simulations, with our implementation of a mass preconditioned HMC, including 
for the first time a locally deflated solver, which brings a significant speedup. We have confirmed 
that relatively large step size for small quark mass is achievable also with MP-HMC. Looking at the 
series of average and maximal forces of F2 in both cases, it is indicative that the energy violations 
visible as spikes in AH are caused by the irregularities in the forces. This is easier to control with 
a continuous parameter, such as the preconditioning mass in mass preconditioning case, than with 
the HMC block size in DD-HMC which can only take few values in practical applications. The 
two algorithms have demonstrated comparable performance, which is in large owed to the usage of 
the same efficient deflated solver in both cases. 

A future task is to extend this study to larger lattices with even lower quark masses. The MP- 
HMC also leaves room for improvement. One could study the use of three or more pseudo-fermion 
fields and also tune the preconditioning masses more carefully than we did in the present results. 
In particular for smaller preconditioning masses, the use of the deflated solver also for F\ might be 
advisable. 

Besides being able to use larger numbers of CPUs, the approach with mass preconditioning 
allows for a much easier extension of the program package, such as introducing Schrodinger func- 
tional boundary conditions, as well as adding additional heavy and non-degenerate flavours. 
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